Measurement and Distribution of Index Quality in Research Topics from Academic Databases
Li Keyu,Wang Hao(),Gong Lijuan,Tang Huihui
School of Information Management, Nanjing University, Nanjing 210023, China Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
【目的】 对学术数据库中研究主题的索引术语的质量进行测度并探究其分布特点。【方法】 从Web of Science、CNKI中采集来自人文、社会和自然科学领域的研究主题的索引术语,构建主题、领域和数据库层次的术语空间,将术语区分能力(Term Discriminative Capacity,TDC)作为术语质量评价指标,采用ANOVA分析方法探究不同数据库、领域的研究主题的术语质量分布特点。【结果】 不同领域的研究主题的术语质量在字段分布上均满足:“Abstract”>平均水平>“Keyword”;CNKI的“Title”(Web of Science的“Keyword Plus”)与平均水平相比在不同领域中有所差异,但均低于“Abstract”;Web of Science的“Title”与“Abstract”相比在不同领域中有所差异,但均高于平均水平。【局限】 研究主题不够丰富。【结论】 TDC测度方法具有稳定性和可靠性;通过探究研究主题的术语质量分布特点,可以为选择检索字段入口和提高术语质量提供方向与依据。
[Objective] This paper measures the quality of index terms from research topics in academic databases and explores their distribution characteristics. [Methods] We collected the index terms of research topics in humanities, society and natural sciences from Web of Science and CNKI. Then, we constructed terminology spaces based on research topics, domains and databases. Third, we used term discriminative capacity (TDC) to evaluate their quality. Finally, we conducted ANOVA testing to explore the distribution characteristics of index terms quality from different databases/domains. [Results] The index term quality of research topics followed the rules of “Abstract”> average level >“Keyword”. The “Title” of CNKI (“Keyword Plus” in Web of Science) were lower than “Abstract”, while the “Title” in WoS were lower than average. [Limitations] The amount of research topics in this study needs to be expanded. [Conclusions] The TDC measure method is stable and reliable, which helps us improve the information retrieval services and terms quality.
李轲禹,王昊,龚丽娟,唐慧慧. 学术数据库中研究主题术语的质量测度及分布研究*[J]. 数据分析与知识发现, 2020, 4(6): 91-108.
Li Keyu,Wang Hao,Gong Lijuan,Tang Huihui. Measurement and Distribution of Index Quality in Research Topics from Academic Databases. Data Analysis and Knowledge Discovery, 2020, 4(6): 91-108.
( Yi Zhongmei. Analysis on Recall Ratio and Accuracy Ratio of Information Retrieval Based on Retrieval Practices[J]. Science & Technology Information, 2008(24):363-364.)
( Zhang Ling. Comparative Analysis of the Retrieval Functions of China Journal Database and Its Influence Factors[J]. Information Studies: Theory & Application, 2001,24(2):120-121.)
[3]
Wolfram D, Zhang J. The Impact of Term-indexing Characteristics on a Document Space[J]. Canadian Journal of Information & Library Science, 2001,26(4):21-35.
[4]
Wolfram D, Zhang J. An Investigation of the Influence of Indexing Exhaustivity and Term Distributions on a Document Space[J]. Journal of the American Society for Information Science and Technology, 2002,53(11):943-952.
doi: 10.1002/(ISSN)1532-2890
[5]
Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975,18(11):613-620.
doi: 10.1145/361219.361220
[6]
Zhang J, Yu Q, Zheng F S, et al. Comparing Keywords Plus of WOS and Author Keywords: A Case Study of Patient Adherence Research[J]. Journal of the Association for Information Science & Technology, 2016,67(4):967-972.
[7]
魏凤萍, 何益华, 方吉, 等. 基于Web of Science的机构文献检索策略[J]. 上海高校图书情报工作研究, 2019,29(1):81-86.
[7]
( Wei Fengping, He Yihua, Fang Ji, et al. Organization Literature Retrieval Strategy Based on Web of Science[J]. Research on Library & Information Work of Shanghai Colleges & Universities, 2019,29(1):81-86.)
( Jiang Hongchun. Relations Among Natural Science, Social Science and Human Studies Under the Analysis on the Spectrum of Disciplines[J]. Studies in Dialectics of Nature, 2014,30(6):61-67.)
( Li Junlian, Wang Xuwen, Xia Guanghui, et al. Construction of Common Concept List for Automatic Text Subject Indexing[J]. Information Studies: Theory & Application, 2017,40(4):95-99.)
[13]
黄丹丹. 基于深度学习的中文分词和关键词抽取模型研究[D]. 北京:北京邮电大学, 2019.
[13]
( Huang Dandan. Research on Chinese Word Segmentation and Keyword Extraction Model Based on Deep Learning[D]. Beijing: Beijing University of Posts and Telecommunications, 2019.)
( Zhang Haichao, Wang Hao, Tang Huihui, et al. Application of CRFs Chinese Character Role Labeling Method in Chinese Keywords Plus Extraction[J]. Information Studies: Theory & Application, 2019,42(2):169-176.)
[15]
Chemical Indexing [EB/OL]. [2020-02-17]. https://www.theiet.org/media/5239/chemical-indexing-updated-jan-2020.pdf.
( He Lin, Chang Yingcong. Comparative Study of Subject Presentation with Different Indexing Strategies[J]. Library Journal, 2014,33(5):29-33.)
[18]
Willett P. An Algorithm for the Calculation of Exact Term Discrimination Values[J]. Information Processing & Management, 1985,21(3):225-232.
doi: 10.1016/0306-4573(85)90107-4
[19]
Zhang J, Wolfram D. Visualization of Term Discrimination Analysis[J]. Journal of the American Society for Information Science and Technology, 2001,52(8):615-627.
doi: 10.1002/(ISSN)1532-2890
[20]
Pushpalatha K P, Raju G. Compactness-A Useful Feature for Generating Search Index [C]// Proceedings of the 2012 IEEE International Conference on Technology Enhanced Education(ICTEE), Kerala, India. 2012.
[21]
Cai D, van Rijsbergen C J. Learning Semantic Relatedness from Term Discrimination Information[J]. Expert Systems with Applications, 2009,36(2):1860-1875.
doi: 10.1016/j.eswa.2007.12.072
[22]
Lu K, Mao J. An Automatic Approach to Weighted Subject Indexing-An Empirical Study in the Biomedical Domain[J]. Journal of the Association for Information Science and Technology, 2015,66(9):1776-1784.
doi: 10.1002/asi.23290
[23]
Lu K, Cai X, Ajiferuke I, et al. Vocabulary Size and Its Effect on Topic Representation[J]. Information Processing & Management, 2017,53(3):653-665.
doi: 10.1016/j.ipm.2017.01.003
[24]
Labani M, Moradi P, Ahmadizar F, et al. A Novel Multivariate Filter Method for Feature Selection in Text Classification Problems[J]. Engineering Applications of Artificial Intelligence, 2018,70:25-37.
doi: 10.1016/j.engappai.2017.12.014
[25]
Bernauer L, Han E J, Sohn S Y. Term Discrimination for Text Search Tasks Derived from Negative Binomial Distribution[J]. Information Processing & Management, 2018,54(3):370-379.
doi: 10.1016/j.ipm.2018.01.003
[26]
Lakshmi R, Baskar S. Novel Term Weighting Schemes for Document Representation Based on Ranking of Terms and Fuzzy Logic with Semantic Relationship of Terms[J]. Expert Systems with Applications, 2019,137:493-503.
doi: 10.1016/j.eswa.2019.07.022
( Wang Hao, Tang Huihui, Zhang Haichao, et al. A Study on the Measurement Methods of Term Discriminative Capacity for Academic Resources[J]. Journal of the China Society for Scientific and Technical Information, 2019,38(10):1078-1091.)
( Liu Qiyuan, Ye Ying. A Study on Mining Bibliographic Records by Designed Software SATI: Case Study on Library and Information Science[J]. Journal of Information Resources Management, 2012,2(1):50-58.)
( Xiong Xin, Wang Hao, Zhang Haichao, et al. Impacts of Chinese Term Granularity on Measuring Term Discriminative Capacity[J]. Data Analysis and Knowledge Discovery, 2020,4(2-3):143-152.)
[31]
Korfhage R R. Information Storage and Retrieval[M]. New York: Wiley, 1997.
[32]
Zhang J, Korfhage R R. A Distance and Angle Similarity Measure Method[J]. Journal of the American Society for Information Science, 1999,50(9):772-778.
doi: 10.1002/(SICI)1097-4571(1999)50:9<>1.0.CO;2-J
[33]
Salton G, Yang C S, Yu C T. Theory of Term Importance in Automatic Text Analysis[J]. Journal of the American Society for Information Science, 1975,26(1):33-44.
doi: 10.1002/(ISSN)1097-4571